manhattan_rides_df <- read_csv("manhattan_rides.csv")
manhattan_rides_df <-
manhattan_rides_df %>%
mutate(
day_of_week = factor(day_of_week, ordered = T,
levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")),
year = factor(year),
age_group = factor(age_group, ordered = T,
levels = c("18-25","26-35", "36-45", "46-55", "56-65", "66-85"))
)
manhattan_rides_df %>%
group_by(age_group) %>%
summarize(min = min(age), max = max(age), obs = n())
## # A tibble: 6 x 4
## age_group min max obs
## <ord> <dbl> <dbl> <int>
## 1 18-25 18 25 35068
## 2 26-35 26 35 102948
## 3 36-45 36 45 56694
## 4 46-55 46 55 43430
## 5 56-65 56 65 26232
## 6 66-85 66 85 6734
manhattan_rides_df %>%
group_by(day_of_week, year) %>%
summarize(obs = n()) %>%
ggplot(aes(x = day_of_week, y = obs, group = year, color = year)) +
geom_point() +
geom_line()

Fewer rides during the week in 2020 (presumably because of WFH), but more rides on the weekends (presumably because people avoid subway/ ubers)
manhattan_rides_df %>%
group_by(start_date, year) %>%
summarize(obs = n()) %>%
ggplot(aes(x = start_date, y = obs, group = year, color = year)) +
geom_line() +
geom_smooth(se = FALSE)

Not that helpful, but not a meaningful difference in numbers of rides between 2019 and 2020 except maybe March/ April where there appears to be a slight dip
manhattan_rides_df %>%
group_by(year) %>%
mutate(
month = month(starttime, label = T)
) %>%
filter(tripduration < 1500) %>%
ggplot(aes(x = month, y = tripduration)) +
geom_boxplot() +
facet_grid(~year)

#Plotly Version
manhattan_rides_df %>%
group_by(year) %>%
mutate(
month = month(starttime, label = T)
) %>%
filter(tripduration < 2500) %>%
plot_ly(
x = ~month,
y = ~trip_min,
color = ~year,
type = "box") %>%
layout(
boxmode = "group",
title = "Duration of Citibike Rides by Month",
xaxis = list(title = "Month"),
yaxis = list(title = "Trip Duration in Minutes")
)
Looks like maybe the overall length of trips in 2019 was more consistent. 2020 had a bump in duration of rides, starting in April. Overall, trip length seems more variable in 2020.
manhattan_rides_df %>%
group_by(year) %>%
mutate(
month = month(starttime, label = T)
) %>%
group_by(year, month) %>%
summarise(obs = n()) %>%
ggplot(aes(x = month, y = obs, group = year, color = year)) +
geom_line()

#Updated to plotly
manhattan_rides_df %>%
group_by(year) %>%
mutate(
month = month(starttime, label = T)
) %>%
group_by(year, month) %>%
summarise(obs = n()) %>%
plot_ly(
x = ~month,
y = ~obs,
color = ~year,
type = "scatter",
mode = "lines") %>%
layout(
title = "Number of Citibike Rides per Month",
xaxis = list(title = "Month"),
yaxis = list(title = "Rides")
)
Huge drop in monthly trips in April 2020. Lockdown started mid/late March so this coincides with people transitioning to WFH and largely staying inside to minimize contacts. The ride numbers bounce back quite a bit after this but not to 2019 levels.